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Abstract 

Nitrotyrosine is one of the post-translational modifications (PTMs) in proteins that occurs when their tyrosine residue is 
nitrated. Compared with healthy people, a remarkably increased level of nitrotyrosine is detected in those suffering from 
rheumatoid arthritis, septic shock, and coeliac disease. Given an uncharacterized protein sequence that contains many 
tyrosine residues, which one of them can be nitrated and which one cannot? This is a challenging problem, not only directly 
related to in-depth understanding the PTM's mechanism but also to the nitrotyrosine-based drug development. Particularly, 
with the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop a high 
throughput tool in this regard. Here, a new predictor called "iNitro-Tyr" was developed by incorporating the position- 
specific dipeptide propensity into the general pseudo amino acid composition for discriminating the nitrotyrosine sites 
from non-nitrotyrosine sites in proteins. It was demonstrated via the rigorous jackknife tests that the new predictor not only 
can yield higher success rate but also is much more stable and less noisy. A web-server for iNitro-Tyr is accessible to the 
public at http://app.aporc.org/iNitro-Tyr/. For the convenience of most experimental scientists, we have further provided a 
protocol of step-by-step guide, by which users can easily get their desired results without the need to follow the 
complicated mathematics that were presented in this paper just for the integrity of its development process. It has not 
escaped our notice that the approach presented here can be also used to deal with the other PTM sites in proteins. 
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Introduction 

As one of the post-translational modifications (PTMs) of proteins, 
nitrotyrosine is a product of tyrosine nitration mediated by reactive 
nitrogen species such as peroxynitrite anion and nitrogen dioxide 
(Fig. 1). Compared with the fluids from healthy people, a 
remarkably increased level of nitrotyrosine is detected in those 
suffering from rheumatoid arthritis, septic shock, and coeliac 
disease. Accordingly, knowledge of nitrotyrosine sites in proteins is 
very useful for both basic research and drug development. Although 
conventional experimental methods did provide useful insight into 
the biological roles of tyrosine nitration [1—3], it is time-consuming 
and expensive to determine the nitrotyrosine sites based on the 
experimental approach alone. Particularly, identification of endog- 
enous 3-NTyr modifications remains largely elusive (see, e.g., [4-7]). 
With the avalanche of protein sequences generated in the 
postgenomic age, it is highly desired to develop computational 
methods for identifying the nitrotyrosine sites in proteins. The 
present study was initiated in an attempt to propose a new method 
for identifying the nitrotyrosine sites in proteins in hope that it can 
play a complementary role with the existing methods in this area. 

As summarized in [8] and demonstrated in a series of recent 
publications [9-21], to establish a really useful statistical predictor 
for a biological system, we need to consider the following 



procedures: (i) construct or select a valid benchmark dataset to 
train and test the predictor; (ii) formulate the biological samples 
with an effective mathematical expression that can truly capture 
their essence and intrinsic correlation with the target to be 
predicted; (iii) introduce or develop a powerful algorithm (or 
engine) to operate the prediction; (iv) properly perform cross- 
validation tests to objectively evaluate the anticipated accuracy; (v) 
establish a user-friendly web-server that is accessible to the public. 
Below, let us describe how to deal with these steps one by one. 

Materials and Methods 

1. Benchmark Dataset 

To develop a statistical predictor, it is fundamentally important 
to establish a reliable and stringent benchmark dataset to train and 
test the predictor. If the benchmark dataset contains some errors, 
the predictor trained by it must be unreliable and the accuracy 
tested by it would be completely meaningless. 

For facilitating description later, let us adopt the Chou's peptide 
formulation here that was used for studying HIV protease cleavage 
sites [22,23], specificity of GalNAc-transferase [24], and signal 
peptide cleavage sites [25]. According to Chou's scheme, a 
potential nitrotyrosine peptide, i.e., a peptide with Tyr (namely Y) 
located at its center (Fig. 2), can be expressed as 
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P E (Y) = R_ £ R_ 



• R_ 2 R_jYR +1 R +2 • 



■ R + (c-l) R + i; (!) 



where the subscript cj is an integer, R ^ represents the t,— th 
upstream amino acid residue from the center, R £ the i;— th 
downstream amino acid residue, and so forth. A (2i;+ 1)— tuple 
peptide P^(Y) can be further classified into the following 
categories: 



P*(Y)e 



p 5 + oo, 



if its center is a nitrotyrosine site 



Pf(Y), otherwise 



(2) 



where P^~ (Y) represents a true nitrotyrosine peptide, P^ (Y) a 
false nitrotyrosine peptide, and e represents "a member of in the 
set theory. 

As pointed out by a comprehensive review [26], there is no need 
to separate a benchmark dataset into a training dataset and a 
testing dataset for examining the performance of a prediction 
method if it is tested by the jackknife test or subsampling (K-fold) 
cross-validation test. Thus, the benchmark dataset for the current 
study can be formulated as 



where only contains the samples of Pjt (Y), i.e.. 



(3) 



the 



nitrotyrosine peptides; only contains the samples of Pc (Y), 
i.e., the non-nitrotyrosine peptide (cf. Eq. 2); and U represents the 
symbol for "union" in the set theory. 

Since the length of the peptide P^(Y) is 2^+l(Eq. 1), the 
benchmark dataset with different values of % will contain peptides 
of different numbers of amino acid residues, as formulated by 



The detailed procedures to construct are as follows, (i) Its 
elements were derived based on the same 546 source proteins used 
in [27] that contain 1,044 nitrotyrosine sites (see columns 1 and 2 
of Supporting Information SI), (ii) Slide a flexible window of 
2^+1 amino acids (Fig. 3) along each of the 546 protein 
sequences taken from the Uni-Prot database (version 20 14_0 1). (iii) 
Collect only those peptide segments with Y (tyrosine) at the center, 
(iv) If the upstream or downstream in a protein was less than the 
lacking residue was filled with a dummy residue "X" [28]. (v) 
Those peptide samples thus obtained were put into the positive 
subset if their centers have been experimentally confirmed as 
the nitrotyrosine sites; otherwise, into the negative subset Sr. 

By following the aforementioned procedures, five such bench- 
mark datasets (S^ =6 ,Sc = 7 ,S^ =8 ,Sp = 9 , and S^ =10 ) had been 
constructed. Each of these datasets contained 1,044 nitrotyrosine 
peptides and 7,669 non-nitrotyrosine peptides. Note that the 
sample numbers thus obtained have some minor difference with 
those in [27]. This is because some proteins originally used in [27] 
have been removed or replaced in the updated version of the Uni- 
Prot database. 

However, it was observed via preliminary trials that when = 9, 
i.e., the peptide samples concerned were formed by 19 residues, 
the corresponding results were most promising (see Fig. 4 and 
Fig. 5). Accordingly, we choose S^ =9 as the benchmark dataset 
for further investigation. Thus, Eq. 3 can be reduced to 



§ = §+(J§- 



(5) 



where S=S 9 , S + =S 9 h containing 1,044 nitrotyrosine peptide 
samples, and S~ = containing 7,669 non-nitrotyrosine 
peptide samples. The detailed 19-tuple peptide sequences and 
their positions in proteins are given in Supporting Information S 1 . 



contains the 



' 13 residues, when £, = 6 
15 residues, when^ = 7 
17 residues, when % = 8 



peptides of | 19 residues, when^ = 9 
21 residues, when % = 10 



2. Feature Vector and Pseudo Amino Acid Composition 

One of the most important but also most difficult problems in 
computational biology today is how to effectively formulate a 
biological sequence with a discrete model or a vector, yet still keep 
(4) considerable sequence order information. This is because all the 
existing operation engines, such as correlation angle approach 
[29], covariance discriminant [30], neural network [31], support 
vector machine (SVM) [32], random forest [33], conditional 
random field [28], K-nearest neighbor (KNN) [34], OET-KNN 
[35], Fuzzy K-nearest neighbor [36], ML-KNN algorithm [37], 




Figure 1. A schematic drawing to show protein nitrotyrosine. 

doi:1 0.1 371 /journal.pone.01 0501 8.g001 
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Figure 2. An illustration to show Chou's scheme for a peptide of (2c 

[55,76] with permission. 

doi:1 0.1 371 /journal.pone.01 0501 8.g002 



1) residues with tyrosine (Y) at the center. Adapted from Chou 



and SLLE algorithm [30], can only handle vector but not 
sequence samples. However, a vector defined in a discrete model 
may totally miss the sequence-order information. To deal with 
such a dilemma, the approach of pseudo amino acid composition 
[38] or Chou's PseAAC [39] was proposed. Ever since it was 
introduced in 2001 [38], the concept of PseAAC has been rapidly 
penetrated into almost all the areas of computational proteomics, 
such as in identifying bacterial virulent proteins [40], predicting 
anticancer peptides [41], predicting protein subcellular location 
[42], predicting membrane protein types [43], analyzing genetic 
sequence [44], predicting GABA(A) receptor proteins [45], 
identifying antibacterial peptides [46], predicting anticancer 
peptides [41], identifying allergenic proteins [47], predicting 
metalloproteinase family [48], identifying GPCRs and their types 
[49], identifying protein quaternary structural attributes [50], 
among many others (see a long list of references cited in a 2014 
article [51]). Recently, the concept of PseAAC was further 
extended to represent the feature vectors of DNA and nucleotides 
[9], as well as other biological samples (see, e.g., [52]). Because it 
has been widely and increasingly used, recendy three types of 
powerful open access soft-ware, called 'PseAAC-Builder' [53], 
'propy' [54], and 'PseAAC-GeneraT [51], were established: the 
former two are for generating various modes of Chou's special 
PseAAC; while the 3 rd one for those of Chou's general PseAAC. 

According to a comprehensive review [8], PseAAC can be 
generally formulated as 



native amino acids or the dummy code X as defined above. 
Hereafter, let us use the numerical codes 1, 2, 3, 20 to 
represent the 20 native amino acids according to the alphabetic 
order of their single letter codes, and use 21 to represent the 
dummy amino acid X. Accordingly, the number of possible 
different dipeptides will be 21x21= 44 1, and the number of 
dipeptide subsite positions on the sequence of Eq. 7 will be 
(2^+l-l) = 2^ 

Now, let us introduce a positive and a negative PSDP (position- 
specific dipeptide propensity) matrix, as given below 



(8a) 





r ■?+ 
1,1 


_ + 
-1,2 


-+ 1 


z+©= 


7 + 
z 2,l 


7 + 
z 2,2 


. 7 + 

2,2^ 




7 + 

.441,1 


7 + 

"441,2 


7 + 

Z 441,2^_ 




z l,l 


Z l,2 


Z U5 




Z 2J 


Z 2,2 


Z 2,2^ 




_ Z 441,1 


Z 441,2 


' Z 441,2^_ 



(8b) 



P=[*l ^2 



(6) 



where the element 



where T is the transpose operator, while Q an integer to reflect the 
vector's dimension. The value of fi as well as the components 
v|/ K (w= 1,2, • • • ,Q) in Eq. 6 will depend on how to extract the 
desired information from a protein/peptide sequence. Below, let 
us describe how to extract the useful information from the 
benchmark datasets to define the peptide samples via Eq. 6. 
For convenience in formulation, let rewrite Eq. 1 as follows 



= R,R 2 



R,R 



5+i 



' ^25^-25+1 



(7) 



where R^ + i, the residue at the center of the peptide, is tyrosine 
(Y), and all the other residues R^i ^^+l) can be any of the 20 



and 



z±=F+(p i \j) 
zQ=F-(p t \f) 



(/ = 1,2, ■■ • ,441 -j = 1,2, ■■ • ,2Q (9) 



D ; = AA,D 2 = AC,D 3 = AD, ■ 



>D 4 4Q — XY,]!)^ —XX 



(10) 



In Eq. 9, i 7+ (D,[/) is the occurrence frequency of the i— th 
dipeptide (i = 1,2,- • ■ ,441) at the j— th subsite on the sequence of 
Eq. 7 (or the j— th column in the positive subset dataset S + ) that 
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can be easily derived using the method described in [55] from the 
sequences in the Supporting Information SI; while F~ (D,-[/) is 
the corresponding occurrence frequency but derived from the 
negative subset dataset §~. Thus, for the peptide sequence of Eq. 
7, its attribute to the positive set S + or negative set S~ can be 
formulated by a (dimension) vector P + or P , as defined by 
[23] 



where 



"\|/, v|/ 2 



' Z + 



■7 + 

A 2\j 



■7 + 

-22,1 



\ ^441,a 



when R„R„- 



z% Au whenR„R„ + 



when R„R„ H 



z lu whenR„R„ + 



when R„R,h 



-21,« wiiunv„n„ + 

z 22,« whenR„R„ + 



+2 



[1 la) 



■ ■ *2%] T ( llb ) 



AA 
AC 

AX [u=l,2, 
CA 



XX 



,2^ = 0] (12a) 



AA 
= AC 



= AX [u=\,2,---,2% = a] (12b) 
= CA 



whenR„R„ + i=XX 



where R u and R„+i represent the residues in the u — th and 
(u+ 1)— th positions of the peptide concerned. 



3. Discriminant Function Approach 

Now in the 2^-D space, let us define an ideal nitrotyrosine 
peptide II + [22] and an ideal non-nitrotyrosine peptide E as 
expressed by 



1+ 









4 




" A 2 




Ir = 


K 


A. 







(13) 



where (i= 1,2, • • • ,2Q is the upper limit of the corresponding 
matrix element in Eq. 12a, and A ; ~(?= 1,2, •• ■ ,2Q is the upper 
limit of the corresponding matrix element in Eq. 12b. Theoret- 
ically speaking, each of these hypothetical upper limits in Eq. 13 
should be 1 [23]. Thus, the similarity score of "Pjt with H + and 
that of Pr with II can be defined as 



p+ n+ 



E 2 „lit; 



(14) 



Similar to the treatment in [23], let us define a discriminant 
function A given by 



h= {n -n - p r -h) - r = w -*.-)-» ( i5 ) 

where 5ft is the adjust parameter used to optimize the overall 
success rate when the positive and negative benchmark datasets 
are highly unbalanced in size. Now the peptide Pp of Eq. 7 can be 
identified according to the following rule 



P, belongs to nitrotyrosine peptide, if A* > 0 

P c belongs to non — nitrotyrosine peptide, if A F < 0 



(16) 



The predictor obtained via the above procedures is called iNitro-Tyr. 
How to properly and objectively evaluate the anticipated accuracy of a 
new predictor and how to make it easily accessible and user-friendly are 
the two key issues that will have important impacts on its application 
value [56] . Below, let us address these problems. 



N 







KTL VK LNPFL 
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ADG TSST 


DI R 











c 



■4 -3 -2 -1 0 +1 +2 +3 *% 



Figure 3. Illustration to show the peptide segment highlighted by sliding the scaled window [—{, + £ along a protein sequence. 

During the sliding process, the scales on the window are aligned with different amino acids so as to define different peptide segments. When, and 
only when, the scale 0 is aligned with Y (tyrosine), is the (2{+l)— tuble peptide segment seen within the window regarded as a potential 
nitrotyrosine peptide. Adapted from Chou [55,77] with permission. 
doi:1 0.1 371 /journal.pone.01 0501 8.g003 
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Positive peptide frequency plot (9,9) 




weblogo.bsrkeley edu 



Negative peptide frequency plot (9,9) 
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Figure 4. A sequence logo plot to show the difference between the positive and negative peptides. The window's size is 1 9 when £ = 
See Eq. 1 and the legend of Fig. 3 for further explanation. 
doi:1 0.1 371 /journal.pone.01 0501 8.g004 



Results and Discussion 

1. Metrics for Scoring Prediction Quality 

In literature the following four metrics are often used to score 
the quality of a predictor at four different angles 



Sn = 



Sp = 



Acc = 



MCC 



TP 
TP + FN 

TN 
TN + FP 

TP + TN 



(17) 



TP + TN + FP + FN 

(TPxTN)-(FPxFN) 

~ v^TP + FP)(TP + FN)(TN + FP)(TN + FN) 



where TP represents the number of the true positive; TN, the 
number of the true negative; FP, the number of the false positive; 
FN, the number of the false negative; Sn, the sensitivity; Sp, the 
specificity; Acc, the accuracy; MCC, the Mathew's correlation 
coefficient. To most biologists, unfortunately, the four metrics as 
formulated in Eq. 17 are not quite intuitive and easy-to- 
understand, particularly the equation for MCC. Here let us adopt 
the formulation proposed recently in [9, 1 1 ,28] based on the symbols 
introduced by Chou [25,55] in predicting signal peptides. Accord- 
ing to the formulation, the same four metrics can be expressed as 



Sn=l- 
Sp = l- 
Acc= 1 

MCC = 



Nl+N- 
~ N++N- 



1- 



N+ + N~- 



A 1+ 



NT -7V + 



N 1 



l + : 



N + -N: 



0 < Sn < 1 
0 < Sp < 1 

0<Acc<l (18) 
- 1 < MCC < 1 



N 



where N + is the total number of the nitrotyrosine peptides 
investigated while the number of the nitrotyrosine peptides 
incorrectly predicted as the non-nitrotyrosine peptides; N~ the total 
number of the non-nitrotyrosine peptides investigated while N~ the 
number of the non-nitrotyrosine peptides incorrectly predicted as 
the nitrotyrosine peptides [57]. 

Now, it is crystal clear from Eq. 18 that when A+ = 0 
meaning none of the nitrotyrosine peptides was incorrectly 
predicted to be a non-nitrotyrosine peptide, we have the sensitivity 
Sn = 1 . When N_ = N + meaning that all the nitrotyrosine 
peptides were incorrectly predicted as the non-nitrotyrosine 
peptides, we have the sensitivity Sn = 0. Likewise, when = 0 
meaning none of the non-nitrotyrosine peptides was incorrectly 
predicted to be the nitrotyrosine peptide, we have the specificity 
Sp=l; whereas = N~ meaning all the non-nitrotyrosine 
peptides were incorrectly predicted as the nitrotyrosine peptides, 
we have the specificity Sp = 0. When = = 0 meaning 
that none of nitrotyrosine peptides in the positive dataset S + and 
none of the non- nitrotyrosine peptides in the negative dataset S~ 
was incorrectly predicted, we have the overall accuracy Acc = 1 
and MCC= 1; when 7V+ = N + and N~ =N~ meaning that all 
the nitrotyrosine peptides in the positive dataset S + and all the 
non- nitrotyrosine peptides in the negative dataset S~ were 
incorrectly predicted, we have the overall accuracy Acc = 0 and 
MCC= - 1; whereas when N+ = N+ /2 and N~ = N~ /2 we 
have Acc = 0.5 and MCC = 0 meaning no better than random 
prediction. As we can see from the above discussion based on Eq. 
18, the meanings of sensitivity, specificity, overall accuracy, and 
Mathew's correlation coefficient have become much more 
intuitive and easier-to-understand. 

It is instructive to point out, however, the set of metrics in Eqs. 
17-18 is valid only for the single-label systems. For the multi-label 
systems, such as those for the subcellular localization of multiplex 
proteins (see, e.g., [58-62]) where a protein may have two or more 
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ROC curve of 10 cross-validation 




0 0.2 0.4 0.6 0.8 1 

1 -specificity 

Figure 5. A plot to show the different ROC curves obtained by the 10-fold cross-validation under different c values. As we can see, 
when {=9, the corresponding AUC (i.e., the area under its curve) is the largest, meaning the most promising compared with the other values of {. 
doi:1 0.1 371 /journal.pone.01 0501 8.g005 



locations, and those for the functional types of antimicrobial 
peptides (see, e.g., [63] where a peptide may possess two or more 
functional types, a completely different set of metrics is needed as 
elaborated in [37]. 

2. Jackknife Cross-Validation 

With a set of clear and valid metrics as defined in Eq. 18 to 
measure the quality of a predictor, the next thing we need to 
consider is how to objectively derive the values of these metrics for 
a predictor. 

In statistical prediction, the following three cross-validation 
methods are often used to calculate the metrics of Eq. 18 for 
evaluating the quality of a predictor: independent dataset test, 
subsampling test, and jackknife test [64]. However, of the three test 
methods, the jackknife test is deemed the least arbitrary that can 
always yield an unique result for a given benchmark dataset [65] . 
The reasons are as follows, (i) For the independent dataset test, 
although all the samples used to test the predictor are outside the 
training dataset used to train it so as to exclude the "memory" 
effect or bias, the way of how to select the independent samples to 
test the predictor could be quite arbitrary unless the number of 
independent samples is sufficiently large. This kind of arbitrariness 
might result in completely different conclusions. For instance, a 
predictor achieving a higher success rate than the other predictor 
for a given independent testing dataset might fail to keep so when 
tested by another independent testing dataset [64]. (ii) For the 



subsampling test, the concrete procedure usually used in literatures 
is the 5-fold, 7-fold or 10-fold cross-validation. The problem with 
this kind of subsampling test is that the number of possible 
selections in dividing a benchmark dataset is an astronomical 
figure even for a very simple dataset, as demonstrated by Eqs.28- 
30 in [8]. Therefore, in any actual subsampling cross-validation 
tests, only an extremely small fraction of the possible selections are 
taken into account. Since different selections will always lead to 
different results even for a same benchmark dataset and a same 
predictor, the subsampling test cannot avoid the arbitrariness 
either. A test method unable to yield an unique outcome cannot be 
deemed as a good one. (iii) In the jackknife test, all the samples in 
the benchmark dataset will be singled out one-by-one and tested 
by the predictor trained by the remaining samples. During the 
process of jackknifing, both the training dataset and testing dataset 
are actually open, and each sample will be in turn moved between 
the two. The jackknife test can exclude the "memory" effect. Also, 
the arbitrariness problem as mentioned above for the independent 
dataset test and subsampling test can be avoided because the 
outcome obtained by the jackknife cross-validation is always 
unique for a given benchmark dataset. Accordingly, the jackknife 
test has been increasingly used and widely recognized by 
investigators to examine the quality of various predictors (see, 
e.g., [33,41,43,45-47,66-72]). 
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Table 1. Comparison of the new iNitro-Tyr predictor with the existing predictors in identifying the nitrotyrosine sites; the rates 
listed below were derived by the jackknife cross-validation on the 546 source proteins used in [27]. 





Predictor 


Threshold 


Acc (%) 


MCC 


Sn (%) 


Sp (%) 


GPS-YN02 a 


High 


82.57 


0.1884 


28.89 


90.02 




Medium 


79.60 


0.2171 


40.53 


85.02 




Low 


76.51 


0.2335 


50.09 


90.18 


iNitro-Tyr b 




84.52 


0.4905 


81.76 


85.89 



a As reported in [27], where 1 = 1, i.e., the length of the potential nitrotyrosine peptides considered is (2c + l) = 15. 

b See Eqs. 15-16, where :R = 0.70 and C = 9, i.e., the length of the potential nitrotyrosine peptides considered is (2c + l) = 19. 

doi:1 0.1 371 /journal.pone.01 0501 8.t001 



Accordingly, in this study we also used the jackknife cross- 
validation method to calculate the metrics in Eq. 18 although it 
would take more computational time. 

3. Comparison with Other Methods 

The jackknife test results by iNitro-Tyr on the benchmark 
dataset S = S + |JS~ (cf. Supporting Information SI) for the 
four metrics defined in Eq. 18 are listed in Table 1, where for 
facilitating comparison, the corresponding results by GPS-YN02 
[27] with different thresholds are also given. 

From the table, we can see the following facts, (i) The overall 
accuracy by the current iNitro-Tyr predictor is Acc = 84.52%, 
which is higher than the overall accuracy by GPS-YN02 
regardless what threshold is used for the latter, (ii) The Mathew's 
correlation coefficient obtained by iNitro-Tyr is MCC = 0.4905, 
which is significantly higher than that by GPS-YN02, indicating 
that the new predictor is more stable and less noisy, (iii) The 
sensitivity and specificity obtained by iNitro-Tyr are Sn = 81.76% 
and Sp = 85.89%, which are much more evenly distributed than 
those by the GPS-YN02 predictor. 

It is instructive to point out that, as shown by Eqs. 12a and b, 
the amino acid pairwise coupling effects [1 1] has been incorpo- 
rated via the general form of PseAAC [8] to formulate the peptide 
samples. If, however, we just used the single amino acid specific 
position occurrence frequency to formulate the peptide samples, 
the corresponding prediction quality would drop down to 
Acc = 44.88% and MCC = 0.1656, clearly indicating that consid- 
eration of the amino acid pairwise coupling effects could 
significantly enhance the prediction quality, fully consistent with 
the reports by previous investigators [73,74], where it was 
observed that the prediction of protein secondary structural 
contents had been remarkably improved by taking into account 
the amino acid pairwise coupling effects. 

Accordingly, compared with the best of existing predictors for 
identifying the nitrotyrosine sites in proteins, the new iNitro-Tyr 
predictor not only can yield higher or comparable accuracy, but is 
also much more stable and less noisy. It is anticipated that iNitro- 
Tyr may become a useful high throughput tool in this area, or at 
the very least play a complementary role to the existing predictors. 

4. Web-Server and User Guide 

For the convenience of most experimental scientists, we have 
established a web-server for the iNitro-Tyr predictor, with which 
users can easily get their desired results according to the steps 
below without the need to understand the mathematical equations 
in the method section. 

Step 1. Open the web server at http://app.aporc.org/iNitro- 
Tyr/and you will see the top page of the predictor on your 
computer screen, as shown in Fig. 6. Click on the Read Me 



button to see a brief introduction about iNitro-Tyr predictor and 
the caveat when using it. 

Step 2. Either type or copy/paste the sequences of query 
proteins into the input box shown at the center of Fig. 6. All the 
input sequences should be in the FASTA format. A sequence in 
FASTA format consists of a single initial line beginning with the 
symbol ">" in the first column, followed by lines of sequence data 
in which amino acids are represented using single-letter codes. 
Except for the mandatory symbol ">", all the other characters in 
the single initial line are optional and only used for the purpose of 
identification and description. The sequence ends if another line 
starting with the symbol ">" appears; this indicates the start of 
another sequence. Example sequences in FASTA format can be 
seen by clicking on the Example button right above the open box. 
Note that if your input protein sequences should be formed by the 
20 native amino acid codes (AC DEFGHIKLMNPQRST VWY) . 

Step 3. Click on the Submit button to see the predicted 
results. For example, if you use the two query protein sequences in 
the Example window as the input, after clicking the Submit 
button, you will see the following on your screen, (i) The 1 st protein 
(P05181) contains 18 Y residues; of which only those located at the 
sequence position 71, 318, 349, 381, and 423 are of nitrotyrosine 
site, while all the others are of non-nitrotyrosine site, (ii) The 2nd 
protein (P03023) contains 8 Y residues; of which only those located 
at the sequence positions 7, 12, 17, and 47 belong to the 
nitrotyrosine site, while all the others belong to non-nitrotyrosine 
site. All these results are fully consistent with experimental 
observations except for one Y residue at the position 349 in the 
1 st protein (P05181) that is actually non-nitrotyrosine site but was 
overpredicted as nitrotyrosine site. 

Step 4. As shown on the lower panel of Fig. 6, you may also 
submit your query proteins in an input file (with FASTA format) 
via the "Browse" button. To see the sample of input file, click on 
the Example button right under the input box. 

Step 5. Click on the Data button to download the benchmark 
dataset used to train and test the iNitro-Tyr predictor. 

Conclusions 

As one of the important posttranslational modifications (PTMs), 
nitrotyrosine is a product occurring in proteins when their tyrosine 
(Tyr or Y) residue is nitrated. Since a remarkably increasing level 
of nitrotyrosine is detected for those patients who have suffered 
from rheumatoid arthritis, septic shock, and coeliac disease, 
knowledge of nitrotyrosine is very useful for developing drugs 
against these diseases. 

A new predictor was developed for identifying the nitrotyrosine 
sites in proteins based on a set of 19-tuple peptides generated as 
follows. Sliding a window of 19 amino acids along each of the 546 
protein sequences taken from a protein database, collected were 
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iNitro-Tyr: Prediction of nitrotyrosine sites in proteins with 
general pseudo amino acid composition 

| Read Me | Data | Citation | 
Enter or copy/paste query protein sequences in FASTA format ( Example) 



Upload input file in FASTA format ( Example) 
Upload your input file: 



Browse 



Submit 



Clear 



Contact @ Yan Xu 
Close 



Figure 6. A semi-screenshot to show the top page of the iNitro-Tyr srver. Its website address is at http://app.aporc.org/iNitro-Tyr/. 
doi:1 0.1 371 /journal.pone.01 0501 8.g006 



only those peptide segments with Y (tyrosine) at the center, i.e., the 
potential nitrotyrosine-site-containing peptides. The benchmark 
dataset thus obtained contains 1,044 experiment-confirmed 
nitrotyrosine peptides and 7,669 non-nitrotyrosine peptides. 

The new predictor is called iNitro-Tyr, in which each of the 
potential nitrotyrosine-site-containing peptides was formulated 
with a 18-D vector formed by incorporating the position-specific 
dipeptide propensity (PSDP) into the general form [8] of pseudo 
amino acid composition [38,75] or Chou's PseAAC [39,51,54]. 

It has been observed by the rigorous cross validations that the 
iNitro-Tyr not only yields higher success rates but also is more 
stable and less noisy as reflected by a set of four metrics generally 
used to measure the quality of a predictor from different angles. 

For the convenience of most experimental scientists, the web- 
server of iNitro-Tyr has been established at http:/ / app.aporc.org/ 
iNitro-Tyr/. Furthermore, to maximize their convenience, a step- 
by-step guide has been provided, by which users can easily get 
their desired results without the need to follow the complicated 
mathematics that were presented in this paper just for the integrity 
of the predictor. 

It has not escaped our notice that the current approach can also 
be used to develop various effective methods for identifying the 
sites of other PTM sites in proteins. 



Supporting Information 

Supporting Information SI The benchmark dataset used 
in this study contains 8,713 peptides formed by 19 
amino acid residues with Y (tyrosine) at the center. Of 

these peptides, 1,044 are of nitrotyrosine and 7,669 of non- 
nitrotyrosine. Listed are also the codes of the source proteins from 
which these 1 9-tuple peptide sequences are derived as well as their 
corresponding sites in proteins. See the main text for further 
explanation. 
(DOC) 
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